Ecological fallacy and covariates: new insights based on
multilevel modelling of individual data
Michela Gnaldi,
Department of Political Sciences, University of Perugia, Perugia, Italy, michela.gnaldi@unipg.it
Venera Tomaselli,
Department of Political and Social Sciences, University of Catania, Catania, Italy
Antonio Forcina,
Department of Economics, University of Perugia, Perugia, Italy
September 11, 2015
Abstract
This paper deals with the issue of ecological bias in ecological inference. We provide
an explicit formulation of the conditions required for the ordinary ecological regression
to produce unbiased estimates and argue that, when these conditions are violated, any
method of ecological inference is going to produce biased estimates. These ﬁndings are
clariﬁed and supported by empirical evidence provided by comparing the results of three
main ecological inference methods with those of multilevel logistic regression applied to
a unique set of individual data on voting behaviour.
The main ﬁndings of our study
have two important implications that apply to all situations where the conditions for no
ecological bias are violated: (i) only ecological inference methods that allow to model the
eﬀect of covariates have a chance to produce unbiased estimates; (ii) the set of covariates
to be included in the model to remove bias is limited to the marginal proportions. Finally,
our results suggest that, when the association between two ecological variables is very
weak, it is not possible to obtain unbiased estimates even by an appropriate model that
accounts for the eﬀect of relevant covariates.
1
arXiv:1509.03055v1  [stat.AP]  10 Sep 2015

Keywords: Ecological inference, voting behaviour, logistic regression, multilevel models.
1
Introduction
In studies of voting behaviour, an important issue is the shape and strength of association
between the choices of voters in a given election and those in a diﬀerent election. One may
also be interested to study how much voters’ choices depend on their sex, age, education and
social class. However, due to the special nature of electoral data, the true joint distributions
between such pairs of variables cannot usually be observed and the only source of information
can be extracted either from oﬃcial electoral data aggregated at the level of local units, such
as precincts or polling stations, or from sampling surveys.
Inference based on aggregated data has the advantage, relative to those based on sample
surveys, to be inexpensive and to refer to the whole population under investigation. Sample
surveys, on the other hand, though based on direct answers given by voters, may be biased
when the number of non-respondents is not negligible. Additional bias may be caused by
respondents not saying the truth or non remembering what they voted in the last election.
An attempt to compare the estimates provided by the two approaches in the study of the
association between the choices made in two related elections is presented in Russo (2014).
However, because in her survey data the proportions of voters for diﬀerent parties in the
elections seem to be substantially diﬀerent from the corresponding proportions in the oﬃcial
data, it seems diﬃcult to draw valid conclusions from her study where the data about the
true joint distribution are not available. Liu (2007) also compared the performance of sample
surveys to that of several ecological inference methods for estimating the association between
race and propensity to register in a New Orleans mayoral election. He showed that, relative to
the proportions of registered voters provided by true data, most ecological inference methods
did better than survey data in that speciﬁc context.
Since Robinson (1950)’s seminal paper, it is well known that the association between
two variables estimated from data aggregated within geographical units, like polling stations,
may be substantially biased relatively to the association that would emerge if data recorded
2

at the individual level were available. The phenomenon, which came to be known as the
ecological fallacy, was used by Robinson as an argument for banning ecological inference from
sociological investigations. Nowadays, this recommendation has not, perhaps, many followers
due to a better understanding of the conditions which may produce an ecological fallacy
and the emergence of more sophisticated methods of ecological inference.
In addition, as
Subramanian et al. (2009) pointed out, in certain contexts, the degree of association at the
individual level may depend on modelling assumptions and thus may not be such an objective
quantity as Robinson seemed to believe. An important implication of this result is that, when
the results of ecological and individual level studies do not agree, additional investigation may
be necessary before concluding that the ecological estimates are inappropriate.
Though in recent years the reputation of ecological inference has grown within the scien-
tiﬁc community, there is still substantial disagreement about the relative merits of diﬀerent
ecological inference methods.
Because King’s methods (King et al., 1999, 2004) are, per-
haps, the most popular, a big bulk of studies focuses on the merits and shortcomings of these
methods. For example, Freedman et al. (1998) apply King’s model to several datasets where
individual-level data are available. They ﬁnd that the King’s method gives essentially the
same estimates as ecological regression and that both estimates are far from truth. See also
Hudson et al. (2010) who presents a critical reading of ecological inference techniques, using
data coming from New Zealand historic elections, and limited to the 2x2 tables. On a similar
ground, Tam Cho (1998) reports that the point estimates provided by the King’s basic model
and by ecological regression are indistinguishable, after accounting for the standard errors.
Contrary to Freedman et al. (1998) Liu (2007) claims that King’s extended model with co-
variates performs best. As to the extended versions of the King’s method, Freedman et al.
(1998) and Tam Cho (1998) show that the enhancement of the model due to the inclusion of
covariates depends crucially on the speciﬁc covariates included in the model; in other words,
while the inclusion of certain covariates produces ecological estimates which are closer to those
in the individual data, the diagnostics proposed in the King’s framework are of little help in
the choice of the covariates to be included in the model.
3

The objective of this paper is a deeper investigation into the ecological bias and, therefore,
goes beyond an inspection of the King’s methods. First we review and clarify the conditions
required for ecological inference to produce unbiased estimates and provide a simple argument
which shows that the same conditions are required by any method of ecological inference to
avoid bias. This result has two important implication: (i) no simple solution to ecological
fallacy exists, but (ii) the bias may be corrected by modelling the eﬀect of a well deﬁned set
of covariates.
To focus our investigation, we concentrate on three ecological inference methods: the so
called Goodman (1953) model, the ordinary least square (OLS) version of King’s Multinomial-
Dirichlet model (Rosen et al., 2001) and a revised version of the model proposed by Brown
and Payne (1986). The latter model may be seen as a reﬁnement of the Goodman model, in
that transition probabilities are allowed to vary at random between polling stations and may
depend on available covariates.
Empirical support for our ﬁndings is provided by the analysis of an extensive set of indi-
vidual data on voting behaviour from the Democratic Party primary election for the candidate
mayor in the city of Palermo, Italy. The individual records contain information on the de-
cision to vote, and on voters’ sex and age. In the ecological version, voters are aggregated
at the level of polling stations, small local units with an average of about 950 voters. For
these data, all three ecological inference methods lead to estimates which are substantially
diﬀerent from those provided by the individual data which, in our context, do not depend on
modelling assumptions. The detected biases have an explanation along the lines of Wakeﬁeld
(2004): the proportions of people who turn out to vote within each group, deﬁned by sex and
age, are highly correlated with the relative size of these groups. The fact that the estimates
continue to be biases even if we ﬁt ecological inference models where the relative size of such
groups are used as covariates is explained by the fact that the association between ecological
variables is very weak, that is, sex and age are very poor predictors of voting decision.
The paper is organised as follows. In Section 2 we recall the main features of ecological
data, review some methods of ecological inference and explain why, even when individual
4

data are available, the assessment of the nature of the association may depend on modelling
assumptions. In Section 3 the dataset on voting behaviour for the city of Palermo is described.
In Section 4 we analyse the data, clarify why ecological bias is to be expected in this context
and why, even the use of appropriate covariates, can not solve the problem. In Section 5 we
discuss the main results.
2
Methodological aspects
We ﬁrst introduce the notation and describe the models we will apply in the following. Let
Y denote the set of C options available to voters in a given election and X a discrete variable
with R categories which we expect to be associated with Y , like, for instance, age groups, sex,
social class or the choice made in a previous election.
Suppose we are interested in the association between X and Y within a given area, like a
borough divided into a collection of N polling stations. Let nuij denote the number of voters
in polling station u with X = i and Y = j; in most cases, these records, which we may call
individual data, cannot be observed and we have to rely on the aggregated data: xui, the
number of those who voted X = i in the previous election and yuj the number of those who
vote Y = j in the new election. Let puij denote the proportion of voters who choose option
j among those with X = i in polling station u; if X and Y were binary variables, like for
instance when X denotes sex and Y is whether one goes to vote or not, the association may
be measured by the correlation coeﬃcient or the odds ratio. With R × C tables, the degree
of association is 0 when the probabilities puij do not change with i.
2.1
Ecological inference methods
In this paper we focus on three well-known ecological inference methods, by stressing the
diﬀerent assumptions they are based on. Let πij denote the probability that a voter with
X = i choose Y = j.
The so called Goodman (1953) model is essentially based on two rather strong assumptions
according to which (i) the probability that a voter with X = i chooses Y = j does not depend
5

on polling stations and is equal to πij and (ii) voters decide independently from each other.
These assumptions imply that the nu voters in polling station u split among the C options
according to a sum of R multinomial distributions. Under these assumptions no ecological
fallacy can arise and an unbiased estimate of the πij parameters can be computed by OLS
applied to the following set of equations where we write vuj = yuj/nu and tui = xui/nu to
denote the Y and X marginal proportions in each polling stations, respectively.
Because the marginal proportions tui sum to 1, the proportion for X = R, that is tuR,
may be replaced with 1 −PR−1
1
tui
vuj =
R
X
i=1
tuiπij + ϵuj = πRj +
R−1
X
i=1
(tui −tuR)πij + ϵuj.
(1)
Note also that, because the probabilities πij also sum to 1 within each row, the equation for
the last column where j = C is redundant.
A collection of methods proposed by Gary King and his co-workers - see, for instance,
King (1997), King et al. (1999) and Rosen et al. (2001) - have become very popular within
certain scientiﬁc communities where they are considered to be the most advanced methods
of ecological inference, though their merits are debated, see Liu (2007) for a review. The
well-known basic King model (King, 1997; King et al., 1999) incorporates three assumptions.
First, the parameters are not constant and their variation can be described by a truncated
bivariate normal distribution. Second, there is no ”aggregation bias” that is the parameters
are assumed to be uncorrelated with the regressors. Third, the data do not exhibit any spatial
autocorrelation. In spite of their sophisticated Bayesian framework, the above assumptions,
when translated into models of voting behaviour, are rather simple. For instance, the model
described in Rosen et al. (2001), imply that the probability that a voter in polling station
u chooses option j equals P
i nuipuij. This is equivalent to assume that all voters in polling
station u behave as an homogeneous group, irrespective of their value of X, an assumption
which seems rather unrealistic, especially in the context of voting behaviour when X is the
party voted in a previous election occasion; a similar criticism was raised by Greiner and
Quinn (2009).
6

The model proposed by Brown and Payne (1986) may be seen as a reﬁnement of the
Goodman model both in the assumptions on voting behaviour and in the method used to
estimate the unknown parameters.
More precisely, the model assumes that a voter, who
lives in polling station u and had chosen X = i, will choose the voting option Y = j with
a probability which is no longer the same in all polling stations, but may vary at random
as in a Dirichlet distribution whose average, which we may call again πij, is common to all
polling stations, if covariates are not available. In addition, because parameters are estimated
by maximum likelihood on a logistic scale, the estimates are more eﬃcient and lie always
between 0 and 1. Recently, Forcina et al. (2012) have proposed a modiﬁed version of the
above model by assuming that voters who share the same transition probabilities are only
those who have the same X = i, live in the same local unit and, in addition, belong to the
same circle of friends or relatives. This is equivalent to assume that the decisions of voters are
correlated only within smaller clusters whose size may vary at random. This model, which we
will apply in the following, diﬀers from the Brown and Payne model only for a component in
the expression of the covariance matrix of the observations; readers interested in the technical
details, are referred to Forcina et al. (2012).
2.2
The ecological fallacy revisited
In spite of the impact that Robinson’s paper had on the scientiﬁc community, the true nature
of ecological bias is not always well understood. For instance, according to Russo (2014), the
ecological fallacy is due to the ”incorrect assumption that individual members of a group have
the same characteristics as those of the group taken as a whole”, a statement which is too
vague to be meaningful, unless the Author wanted to say that ecological inference is based
on the assumption that transition probabilities, within each local unit, are the same as in the
whole town to which they belong.
Though the conditions required to prevent ecological bias in King’s models are clearly
stated, see for instance King et al. (1999), the title of one of his ﬁrst papers ”A solution to
the ecological inference problem” (King, 1997) may have lead to believe that these methods
7

have some kind of intrinsic protection against the ecological fallacy; for instance (Seligson,
2002) seems to be convinced that Gary King has advanced ”towards solving the ecological
inference problem” as long as we have ”relatively homogeneous ecological units”, a rather
vague statement, though, somehow, closer to the truth.
Perhaps, one of the simplest explanations of the mechanism underlying the ecological
fallacy is provided by Wakeﬁeld (2004) in the case where X and Y are binary variables and
there are only two local units (for example, polling station 1 and 2); a numerical example
along those lines is given in Table 1. Moving from the ﬁrst to the second polling station, the
proportion of males increases from 0.2 to 0.6 and, at the same time, the overall proportion of
those who turn out to vote decreases from 0.7 to 0.4; thus, any method of ecological inference
would lead us to conclude that the proportion of those who turn out to vote is smaller among
males relative to females. On the other hand, if we look at the joint distribution within each
local unit, we see that the proportion of those who go to vote is higher for males relative to
females in each polling station.
INSERT TABLE 1 HERE TABLE 1: Example of ecological fallacy with two polling
stations
The apparent paradox has been produced intentionally as follows: when we go from the
ﬁrst to the second polling station the proportions of those who turn out to vote decrease both
among males (from 0.15/0.20 to 0.30/0.60) and among females (from 0.55/0.80 to 0.10/0.40)
while, at the same time, the marginal proportion of males increases (from 0.2 to 0.6). In other
words, the proportions of voters within males and females are not constant and, in addition,
their variations are correlated with the proportions in the row totals.
To examine the case where we have a collection of N polling stations with an R × C table
of association, it is convenient to start with the Goodman (1953) linear regression model.
The model assumes that voters with X = i choose option Y = j with a probability πij,
constant across polling stations, where voters choose independently from one another. These
assumptions imply that the nui voters in polling station u with X = i split among the C
options according to a multinomial distribution. It is useful to derive the set of equations for
8

this model in order to obtain an explicit expression for the error term. We may transform the
accounting equation, with simple algebra, to obtain:
vuj =
R
X
i=1
tuipuij =
R
X
1
tuiπij +
R
X
1
tui(puij −πij),
(2)
the second component in the equation above is the error term, which we may denote by ϵuj.
By using again the fact that tuR = 1 −PR−1
1
tui, we may write:
vuj = πRj +
R−1
X
i=1
(tui −tuR)πij + ϵuj;
(3)
this is the basic equation in Goodman linear regression model. The condition for the least
square estimates to be unbiased, see for exampleWooldridge (2004), is that ϵuj is uncorrelated
with the tui proportions which are the explanatory variables in the regression model; this
condition, in turn, implies, for any i, that:
E [(puij −πij) | tu1, . . . , tuR] = 0¯.
(4)
We may summarize the above by saying that a necessary and suﬃcient condition for
ecological regression to provide unbiased estimates is that the puij proportions are uncorrelated
with the marginal proportions tui. This means that the condition for the linear regression
model to provide unbiased estimates are satisﬁed even if the puij vary at random (like in the
Brown and Payne model) or depend upon other variables, as long as these other variables are
uncorrelated with the set of marginal proportions tui.
Whether or not the above condition is satisﬁed, when we ﬁt a regression model like Good-
man’s, the residuals will always be uncorrelated with the marginal row proportions tui, thus,
when only ecological data are available, it is not possible to check the condition for no ecolog-
ical bias. This is possible, instead, when the joint distribution of X and Y is observed in each
polling station. In particular, when, like in our case, Y is binary, a logistic regression model
can be ﬁt to test whether the observed proportions puij depend on the marginal proportions
9

tui.
2.3
Individual data, multilevel models and the role of covariates
We now discuss the issues raised by Subramanian et al. (2009), that is whether individual data,
if available, provide always simple and objective estimates of the conditional distributions of
Y , the voting decision, given X, the predictor. Suppose ﬁrst that no covariates are available
and that, though the proportions tui of voters with X = i vary at random among polling
stations, the voting behaviour conditionally on X is the same. In this simple context, we may
sum the joint frequencies nuij with respect to polling stations and estimate the conditional
probabilities as:
ˆπij =
P
u nuij
P
u
P
j nuij
.
(5)
then, these estimates may be compared with those provided by ecological inference. This
is essentially what Robinson did with his census data on race and illiteracy: he computed
an estimate of the conditional probabilities for being illiterate within whites and blacks. To
summarize his results in a more striking way, he also computed the correlation coeﬃcient in
the corresponding 2×2 table and compared it with the one based on conditional probabilities
estimated by ecological regression with the States as local units.
When the conditional probabilities of choosing Y = j given X = i are not constant, but
vary as a logistic function of covariates measured at the level of polling stations, the degree of
association will also vary across polling stations. If, like in our case, Y is binary, one could try
to ﬁt a logistic model and then estimate the degree of association conditionally on a speciﬁc
value of the covariates which is of interest, like their average, or compute an estimate of the
conditional probabilities for each polling station and the average. Given the non linearity of
the logistic model, the second method should be preferred (Greiner and Quinn, 2009).
Subramanian et al. (2009) has studied in detail the case where, in addition to covariates,
there is also substantial random variation across local units. They ﬁt several diﬀerent logistic
multilevel models to the same set of individual data used by Robinson, the 1930 US Census
data. These are individual data by State with adults classiﬁed as illiterate or not and as
10

”white born in US”, ”white not born in US” and ”black”. Their results indicate that the
models which take into account the additional variability across States (their local units) ﬁt
much better than the model based on the naive estimates given in (5); the more sophisticated
models give a substantially diﬀerent picture of association between race and illiteracy.
3
The Palermo dataset: collection and preliminary analyses
On March 4th 2012, the Democratic Party (PD) held its Primary election to choose the
candidate for the mayoral election of the city of Palermo, Italy. Four candidates run for that
position within the centre-left coalition. The voter turnout, almost 30,00 voters, was higher
than in previous primary elections.
The individual dataset used in this paper was constructed as described next. The oﬃcial
electoral rolls of the municipality of Palermo in the 2012 spring listed 564,405 eligible voters
grouped into 593 valid polling stations after excluding those based on temporary communities
like hospitals and prisons. These records provided gender, date of birth and address, for each
eligible voter.
The PD Primary election was held in 31 voting seats, each made of a suitable collection
of neighbouring polling stations. Each eligible voter was then listed in the PD records for
the Primary election by seat and polling station as ‘voter’ or ’non voter’. The individual
data were obtained by matching the two above lists, eligible voter by eligible voter, and then
aggregating by polling station and seat.
An overview of the data is given in Table 2: though the number of participants was
slightly more than 28 thousands, that is about 5% of the eligible voters, participation should
be compared to the strength of the centre-left coalition, made of PD and a few minor parties.
An assessment of their performance may be estimated from the results of the 2012 mayoral
election which took place on May 6th. Though by that time the centre-left coalition had split
into two diﬀerent coalitions, overall they obtained over 16% of the eligible voters and over
18% in the residential area. Almost 31% of the potential participants, estimated from the
mayoral election, voted at the Primary election with the highest participation rate recorded
11

in the suburban area (over 33%).
INSERT TABLE 2 HERE TABLE 2: Voter turnout at the 2012 Primary elections by
areas
In the present study, these individual data may be used to check whether the assumptions
required to apply ecological estimation models are, or are not, violated. The data also allow
us to assess the eﬀects of covariates within a multilevel framework that takes into account the
diﬀerent sources of variability (Hox, 2010; Goldstein, 2003; Rodrigez and Goldman, 1995).
Our data has a three level hierarchical structure: individual voters (as ﬁrst level units)
nested within polling stations (as second level units), and polling stations nested within seats
(as third level units). In multilevel settings, covariates may express voters features as well
as polling station properties, so that we are dealing with ﬁrst and second level covariates,
respectively (Snijders and Bosker, 2012). Speciﬁcally, in this study the ﬁrst level covariates
employed are sex and age groups; the second level covariates are two measures of partisan
leaning and two further measures expressing speciﬁc proportions of voters per sex within age
groups.
Table 3 provides further insights into the data, with voters classiﬁed by sex and age. The
distribution of eligible voters by sex is relatively balanced, even if there are slightly more
women (298,773) than men (265,632). Eligible voters are more frequent in the middle age
groups (30-65 years).
If we look at the voter turnout within age groups, we note that voter shares do not diﬀer
very much between males and females. However, within the 65-75 group, males vote 1.53
times more than females (0.0654 male proportion and 0.0427 female proportion). Besides, in
the 75 and over group, males vote 2.23 times more than females (0.0345 for men and 0.0156
for women). If we compare the 45-65 and the 18-25 age groups, we note that the vote share
is 1.56 times higher for men (0.0688 against 0.0442) and 1.48 times higher for women (0.0647
against 0.0435).
To sum up, all the proportions of voters by sex and age show that, tough eligible female
voters (298,773) are more represented in absolute terms than men (265,632), men turn out to
12

cast a vote more than women.
INSERT TABLE 3 HERE TABLE 3: Eligible voters and voters by sex and age
With the aim to ﬁt multilevel models, we employed four covariates measured at the level
of polling stations:
• pd, the proportion of voters for the Democratic Party at the municipal election held a
month later (3,8% on average for all the polling stations within total eligible voters);
• idv, the proportion of voters for the Italia dei Valori Party at the same municipal election
(5,1% on average for all the polling stations within total eligible voters);
• mol, the proportion of males aged between 45 and 74 (45.0% within male eligible voters);
• fol, the proportion of females aged between 45 and 74 (45.4% within female eligible
voters).
4
The data analysis
4.1
Logistic multilevel models
We ﬁtted a set of logistic regression models similar to those in Subramanian et al. (2009) to
estimate the size and structure of random variations among polling stations in the propensity
to vote and to see how this aﬀects the estimates of the ﬁxed eﬀects. For each polling station,
voters are divided according to sex and to 6 age groups as in Table 3. For the multilevel
analysis, our data consist of 12 binomial observations: voters and non voters for each of the 6
age groups within each sex, nested within each polling station with polling stations grouped
into 31 seats. In this model, there are three sources of variation:
(i) binomial within polling stations;
(ii) among polling stations within seats with a standard deviation of 0.2311;
(iii) among seats with a standard deviation of 0.2547.
13

The above multilevel logistic model provides estimates of the probability to go to vote within
each sex by age group which depends on the covariates and thus are diﬀerent in each polling
station. An overall estimate may be computed by averaging the estimates obtained within each
of the 593 polling stations. In Figure 1 we have plotted the estimates provided by the multilevel
model with covariates together with the raw estimates as in (5) which ignore covariates and
random variation; the two sets of estimates seem to be in rather close agreement.
INSERT FIGURE 1 HERE. FIGURE 1: Estimates of the probability to go to vote; age
groups on the x axis, females in the left panel, males on the right; o marks raw estimates and
△those of the multilevel model
It can be seen that the three diﬀerent estimates of voting probabilities are very similar
within each sex by age group. Thus, in this context, there is not much ambiguity in deter-
mining the pattern of association between sex-age and propensity to vote on the basis of the
data at the individual level. In other words, our data provide reliable targets with which to
assess the accuracy of the estimates provided by ecological inference.
For the Robinson’s data, Subramanian et al. (2009) found that the model (their Model 4)
which allows the random eﬀect due to states to be diﬀerent among their 3 groups by race and
birth ﬁtted much better and produced diﬀerent estimates. In our context, this is equivalent to
assume that the random variation due to polling stations and seats is diﬀerent for each age by
sex group. However, because the routine for estimating this model had convergence problems,
probably due to the presence of three nested random eﬀects, 12 categories in addition to
covariates, we ﬁtted a separate model for each age group. The estimates of the size of the
random eﬀects due to polling stations and seats within each age group are displayed in Table
4. Random variation across polling stations within seats, and across seats seem to be of the
same order of magnitude. Though this model is much more complex, the improvement in the
ﬁt is rather modest.
INSERT TABLE 4 HERE TABLE 4: Standard deviations of the random eﬀects among
polling stations within seats and among seats for each age group
14

4.2
Ecological inference estimates
The estimates of voting probabilities provided by three diﬀerent ecological inference methods
without covariates are displayed in Table 5.
It is easily seen that they are substantially
diﬀerent from those based on individual data in Table 3: for certain age groups, estimated
probabilities are close to 0 while those for other age groups are much too higher than those
provided by all methods of estimation based on individual data displayed in Figure 1. Note
also that the estimates provided by the King OLS and the revised Brown-Payne methods are
rather similar to one another.
INSERT TABLE 5 HERE TABLE 5: Ecological inference estimates of the probability of
voting by sex and age groups, without covariates
Because the results of Section 2.2 tell us that bias should not be present unless the propen-
sity to vote within each group of sex by age is correlated with the relative size of these groups,
we may use the individual data to check the size and the direction of these correlations.
Consider the quartile regressions plotted in Fig 2 where polling stations are grouped, as an
example, on the basis of the proportion of eligible voters aged 65-75. The average proportion
within each quartile is plotted on the x axis against the corresponding average proportion of
voters. Most lines have a positive slope indicating that the propensity to vote increase with
the proportion of eligible voters aged 65-75. A similar picture would emerge if we grouped
polling stations on the basis of the proportion of voters aged 45-65; instead, if we use as
explanatory variable the proportion of voters aged 25-30 or 30-45, regression lines have a
roughly negative slope.
INSERT FIGURE 2 HERE FIGURE 2: Proportion of voters by age groups as a function
of the proportion of eligible votes aged 65-75, Females on the left and males on the right,
+=18-25, ×=25-30, ∗=30-45, □=45-65, ♦=65-75, ▽=over 75
Further evidence is provided by a more formal procedure based on ﬁtting logistic multilevel
models separately for each age group, where the observation at the ﬁrst level are the number
of voters (classiﬁed by sex and age category) nested within polling stations nested, again,
within seats. As covariates we considered the proportion of eligible voters belonging to each
15

age group separately for males and females, in addition to the pd and idv covariates described
above. An informal model selection procedure was used to select which covariates should be
included.
Though the proportion of voters aged 45-65 and 65-75 were signiﬁcant most of the times,
when pd and idv were also used, some of the previous covariates appeared to have no longer
a signiﬁcant eﬀect. This could be due to the fact that pd and idv are closely related to the
age distribution within each polling station. More precisely, when either pd or idv increases,
the proportion of eligible voters in the 18-45 age group decreases while the proportion in the
age range from 45 to 75 and over increases. The parameter estimates are displayed in Table
6.
INSERT TABLE 6 HERE: TABLE 6: Estimated parameters for the multilevel logistic
models
The fact that, most of the times, two or more of the covariates are highly signiﬁcant
indicates that the probabilities to vote for the corresponding age group are strongly correlated
with the marginal proportions and thus consistent estimates cannot be obtained by ecological
inference in this speciﬁc context. The same conclusion is implied by the signiﬁcance of the pd
or idv covariates which are correlated with the marginal distribution of eligible voters.
One may wonder whether ecological estimates that are closer to the truth may be obtained
by using appropriate covariates.
Liu (2007) found that the estimates from King’s model
improved substantially by including certain covariates. We have ﬁtted a revised Brown and
Payne model (see Table 7) where we allow the probability of voting for each sex by age group
to depend on the same set of covariates which appeared to have a signiﬁcant eﬀect in the
logistic multilevel models described above.
Unfortunately, the resulting estimates are not
much better than those given in Table 5.
INSERT TABLE 7 HERE: TABLE 7: Revised Brown and Payne model: estimates for the
probability of voting by sex and age groups with covariates
To gain a better understanding of what is happening, we tried to compare the predictions
provided by the revised Brown and Payne model when covariates are included with those
16

obtained from the logistic multilevel models based on individual data. The actual procedure
is describe below:
• for each polling station we compute the overall number of voters predicted by both the
Brown and Payne model with covariates and by the set of multilevel models (one for each
age group). These predictions were compared with the observed values and the standard
deviation of the error for the two methods was computed. This is equal to 15.45 for the
Brown-Payne model and 15.94 for the multilevel models, thus, ecological inference with
suitable covariates predicts the overall number of voters even more accurately than the
multilevel models, which use individual data
• both estimation methods provide also an estimate of the number of voters classiﬁed by
age and sex within each polling station and we may compare again these predictions with
the observed frequencies and compute the standard deviation of the errors within each
sex by age group for the two methods. Here the picture is quite diﬀerent: the multilevel
models provide a substantially better prediction of the number of voters classiﬁed by sex
and age, as displayed in Figure 2: though for certain groups the two errors are similar,
in others ecological inference estimates have a much larger error.
INSERT FIGURE 3 HERE FIGURE 3: Standard deviation in the prediction of the num-
ber of voters in each sex by age groups; age groups are on the x axis, females on the left panel;
o stands for ecological inference and △for individual data
5
Discussion of main results
This paper gains a deeper understanding of ecological bias by combining theory and empirical
evidence based on the analysis of an extensive set of individual data on voting behaviour
from a large sized Italian town. Individual records available for all eligible voters contain
information on sex, age and whether they went to vote at the 2012 Democratic Party primary
election, in addition to the polling station where the voter is enrolled. Thus, in this study the
true joint distributions of eligible voters by sex, age group and the choice to vote or not at
17

the Primary election is known for each polling station.
To uncover the true structure of association from our individual data, we ﬁt a logistic
multilevel model with a single random intercept, and a further model with as many random
intercepts as the age groups in our data, as in Subramanian et al. (2009). Here, voters are
1-level units nested within polling stations (as 2-level units) nested within seats (3-level units).
We show that, relatively to our context of study, the pattern of association is unequivocal as
the two measures of voting probabilities (averaged across polling stations and for an average
polling station, respectively) are very similar one another and similar to the raw proportion
estimates, within each sex by age group, too. The fact that the pattern of association in our
data is not model dependent as in the Robinson’s data analysed by Subramanian may be due
to the fact that, while our data concern a single (though large) town, he was dealing with
a big country. In addition, our local units (polling stations) are of very small size while in
Subramanian’s case they were very large territorial areas.
We apply three ecological inference methods (Goodman’s linear regression, a modiﬁed
version of the Brown and Payne model and the OLS version of King Dirichlet-Multinomial
model) and compare ecological estimates with the true proportions in the individual-level
data. We ﬁnd that all the three methods provide estimates which are substantially diﬀerent
from the true proportions. An explanation for this is provided by our results in Section 2
which were, partly, anticipated by Wakeﬁeld (2004): the proportion of people who go to
vote within each group, by sex and age, are highly correlated with the relative size of these
groups. Informal evidence for these correlations is provided by quantile regression. To check
formally for this, we ﬁt logistic multilevel model where, among other covariates, we consider
the proportion of eligible voters belonging to each sex by age group - that is, the group sizes
- and two additional measures of partisan leaning. We ascertain that some covariates are
highly signiﬁcant, so that the probabilities to vote for the corresponding sex by age group are
strongly correlated with the marginal proportions obtained by ecological inference. Thus, the
conditions to obtain unbiased estimates are not satisﬁed and this is going to aﬀect any of the
three ecological inference methods considered here.
18

The theory in Section 2 indicates that the only possibility to correct ecological bias is
to model the eﬀect of covariates.
This is what Liu (2007) did in his study, ﬁnding that
the estimates from the King’s model improved substantially by including certain covariates.
However, while Liu was searching among all possible covariates, our work indicates how to
determine which are the potentially relevant covariates by ﬁtting a logistic multilevel model
to the individual data. The interesting result here is that, while the extended version of the
Brown and Payne model with covariates does provide a very accurate ﬁt of the total number
of voters in each polling station, the estimated proportion of voters by sex and age are not
much better than those obtained by the same model without covariates.
We argue that this is due to the fact that the association between sex and age on one
side and voting decision on the other is very weak. When this happens and, in addition,
the response variable (voting decision) is binary, the aggregated data provide insuﬃcient
information to disentangle the true structure of association from ecological data. In other
words, even if we manage to adjust for the sources of ecological fallacy by ﬁtting ecological
inference models with covariates, this may not be suﬃcient to obtain reliable estimates. It may
happen, like in our data, that while the probability to vote for a certain age group increases
with a covariate, it may decrease for another age group relative to the same covariate in such
a way that these two eﬀect partly compensate at the aggregated level. It follows that our
ecological inference model may predict very well the variations of the aggregate proportion
and, at the same time, provide rather poor estimates of the probability of voting within each
age by sex group. This last feature goes beyond ecological fallacy. Subsequent analyses show
in fact that ecological inference with suitable covariates predicts the overall number of voters
in each polling station more accurately than multilevel models based on individual data.
However, multilevel models provide a better prediction within each group of the number of
voters classiﬁed by sex and age. These last results suggest that, even if we manage to adjust
for the sources of ecological fallacy by ﬁtting ecological inference models with covariates, this
may not be suﬃcient to obtain reliable estimates. It may happen, like in our data, that while
the probability to vote for certain categories increases with a covariate, it may decrease for
19

other categories of the same covariate in such a way that these two eﬀects partly compensate
at the aggregated level. It follows that our ecological inference model may predict very well the
variations of the aggregate proportion and, at the same time, provide rather poor estimates
of the probability of voting within each age by sex group.
Overall, the enhancement of the current state of the art provided by the present study is
twofold. On one side, we clarify the conditions under which the estimates provided by any
method of ecological inference are likely to be biased. In so doing, we point out that when
voter’s choices are modelled without accounting for the eﬀects of covariates, they provide
far from truth results.
On the other hand, our results indicate clearly where to look for
appropriate covariates which, once taken properly into account, should, remove bias. Except
in special contexts, like in the Palermo data, when the association between the two ecological
variables of interest is very weak and the response variable is binary. In other terms, even
ecological models which are both based on plausible assumptions and extended to account for
the eﬀects of signiﬁcant covariates aﬀecting vote choices, do not necessarily provide reliable
estimates.
Acknowledgment
For the Palermo primary data, special thanks to Giovanni Barbagallo, member of the Sicilian
Regional Parliament of the Democratic Party.
20

References
Brown, P. and Payne, C. (1986) Aggregate data, ecological regression and voting transitions.
Journal of the American Statistical Association, 81, 453-460.
Forcina, A., Gnaldi, M. and Bracalente, B. (2012) A revised Brown and Payne model of voting
behaviour applied to the 2009 elections in Italy. Statistical Methods and Applications, 21,
109-119.
Freedman, D.A., Klein, S.P., Ostland, M. and Roberts, M.R. (1998) A solution to the eco-
logical inference problem by G. King. Reviewed by Freedman D.A. Klein S.P. Ostland M.
Roberts M.R. Journal of the American Statistical Association, 98, 1518-1522.
Goldstein, H. (2003) Multilevel Statistical Models. London: Arnold.
Goodman, L. A. (1953) Ecological regression and the behaviour of individuals. American
Sociological Review, 18, 351-367.
Greiner, D.J. and Quinn, K.M. (2009) R x C ecological inference: bounds, correlations,
ﬂexibility and transparency of assumptions. Journal of the Royal Statistical Society: Series
A, 172, 67-81.
Hox, J. (2010) Multilevel analysis: Techniques and applications. N.Y.: Routledge.
Hudson, I. L., Moore, L., Beh, E. and Steel, D.G. (2010). Ecological inference techniques:
an empirical evaluation using data describing gender and voter turnout at New Zealand
elections, 1893-1919. Journal of the Royal Statistical Society: Serie A, 173(1), 85-213.
King, G. (1997). A Solution to the Ecological Inference Problem: Reconstructing Individual
Behavior from Aggregate Data. Princeton University Press, Princeton, NJ.
King, G., Rosen, O. and Tanner, M. A. (1999) Binomial-Beta hierarchical models for ecological
inference. Sociological Methods & Research, 28, 61-30.
King, G., Rosen, O. and Tanner, M. A., Eds, (2004). Ecological Inference. Cambridge Uni-
versity Press, Cambridge.
21

Liu, B. (2007) EI extended model and the fear of ecological fallacy. Sociological Methods &
Research, 20, 1-23.
Robinson, W.S.(1950) Ecological correlations and the behaviour of individuals. American
Sociological Review, 15, 351-357.
Rodriguez, G. and Goldman, N. (1995) An assessment of estimation procedures for multilevel
models with binary responses. Journal of the Royal Statistical Society: Serie A, 158, 73-90.
Rosen, O., Jiang, W., King, G. and Tanner, M.A. (2001) Bayesian and frequentist inference
for ecological inference: the RxR case. Statistica Neerlandica, 55, 134-156.
Russo, L. (2014) Estimating ﬂoating voters: a comparison between the ecological inference
and the survey methods. Quality and Quantity, 48, 1667-1683.
Seligson, M.A. (2002) The renaissance of political culture or the renaissance of the ecological
fallacy? Comparative Politics, 34, 273-292.
Snijders, T. and Bosker, R. (2012) Multilevel Analysis: An Introduction to Basic and Advanced
Multilevel Modeling. London: Sage.
Subramanian, SV., Jones, K., Kaddour, A. and Kreiger, N. (2009) Rivisiting Robinson: the
perils of individualistic and ecologic fallacy. International Journal of Epidemiology, 38,
342-360.
Tam Cho, W.K. (1998.) If the assumption ﬁts: a comment on the King ecological inference
solution. Political Analysis, 7(1), 143-163.
Wakeﬁeld, J. (2004) Ecological inference for 2 × 2 tables. Journal of the Royal Statistical
Society: Series A, 167, 1-42.
Wooldridge, J.M. (2004) Econometric Analysis of Cross Section and Panel Data. The MIT
Press, Cambridge, Massachusetts.
22

Table 1:
Example of ecological fallacy with two polling stations
Polling station 1
Polling station 2
No
Yes
Total
No
Yes
Total
F
0.25
0.55
0.80
0.30
0.10
0.40
M
0.05
0.15
0.20
0.30
0.30
0.60
Tot.
0.30
0.70
1.00
0.60
0.40
1.00
Table 2: Voter turnout at the 2012 Primary elections by areas
Primary
Primary
Mayoral
Mayoral
Area
Polling
Eligible
Voters
Centre-Left
Eligible
b/c
a/b
stations
voters
(a)
votes (b)
voters (c)
×100
×100
City Center
114
107389
4406
16575
107252
15.45
26.58
Residential
180
166560
9429
30768
166214
18.51
30.65
Suburban
299
290456
14961
45216
290149
15.58
33.09
OVERALL
593
564405
28796
92559
563615
16.42
31.11
Table 3:
Eligible voters and voters by sex and age.
18-25
25-30
30-45
45-65
65-75
over 75
Totals
Eligible voters by sex and age
Males
32,417
22,015
70,091
89,798
29,725
21,586
265,632
Females
30,490
21,109
73,397
99,564
36,201
38,012
298,773
Voters by sex and age
Males
1,434
986
3,351
6,179
1,945
745
14,640
Females
1,327
986
3,268
6,438
1,545
592
14,156
Proportions of voters by sex and age
Males
0.0442
0.0448
0.0478
0.0688
0.0654
0.0345
265,632
Females
0.0435
0.0467
0.0445
0.0647
0.0427
0.0156
298,773
Table 4:
Standard deviations of the random eﬀects among polling stations within seats and
among seats for each age group
18-25
25-30
30-45
45-65
65-75
over 75
All together
Poll. stations
0.2482
0.2588
0.2628
0.2450
0.2957
0.3339
0.2311
Seats
0.2589
0.1906
0.2321
0.2730
0.2782
0.2591
0.2547
23

Table 5:
Ecological inference estimates of the probability of voting by sex and age groups,
without covariates
Sex
Age groups
Method
18-25
25-30
30-45
45-65
65-75
over 75
Goodman
M
0.000
0.000
0.000
0.123
0.147
0.000
F
0.000
0.000
0.000
0.124
0.000
0.028
Brown-Payne
M
0.000
0.000
0.000
0.000
0.278
0.000
(revised)
F
0.000
0.000
0.000
0.148
0.000
0.152
King OLS
M
0.001
0.001
0.000
0.002
0.264
0.007
F
0.000
0.001
0.000
0.144
0.004
0.161
Table 6:
Estimated parameters for the multilevel logistic models for the propensity to vote;
F is the intercept within females, M −F is the diﬀerence in intercept between males and
females; ◦= non signiﬁcant, ⋆= 5% signiﬁcant, ∗= 1% signiﬁcant, • = p-value smaller than
0.001
.
Parameters
Age groups
18-25
25-30
30-45
45-65
65-75
over 75
F
-4.9207•
-4.5096•
-4.4335•
-3.6369•
-4.7301•
-5.0218•
pd
13.1413•
14.8040•
8.7711•
12.4372•
7.9891•
7.1119•
idv
4.5733⋆
0.0000◦
4.2559∗
5.2272•
4.0019⋆
10.3799•
P(45 −65)
2.3815∗
2.6580∗
1.6830∗
0.0000◦
2.0258⋆
0.0000◦
P(65 −75)
2.3888∗
0.0000◦
1.9564∗
1.3247∗
2.9426•
0.0000◦
M −F
0.0256◦
-0.0570◦
0.0853•
0.0898•
0.4785•
0.8171•
Table 7:
Revised Brown and Payne model: estimates for the probability of voting by sex
and age groups with covariates
Sex
Age groups
18-25
25-30
30-45
45-65
65-75
over 75
M
0.000
0.001
0.000
0.241
0.110
0.053
F
0.000
0.000
0.000
0.027
0.000
0.000
24

Figure 1: Estimates of the probability to go to vote; age groups on the x axis, females in the
left panel, males on the right; o marks raw estimates and △those of the multilevel model
Figure 2: Proportion of voters by age groups as a function of the proportion of eligible votes
aged 65-75, Females on the left and males on the right, +=18-25, ×=25-30, ∗=30-45, □=45-
65, ♦=65-75, ▽=over 75
25

Figure 3: Standard deviation in the prediction of the number of voters in each sex by age
groups; age groups are on the x axis, females on the left panel; o stands for ecological inference
and △for individual data
26
